7 research outputs found
On PAC-Bayesian Bounds for Random Forests
Existing guarantees in terms of rigorous upper bounds on the generalization
error for the original random forest algorithm, one of the most frequently used
machine learning methods, are unsatisfying. We discuss and evaluate various
PAC-Bayesian approaches to derive such bounds. The bounds do not require
additional hold-out data, because the out-of-bag samples from the bagging in
the training process can be exploited. A random forest predicts by taking a
majority vote of an ensemble of decision trees. The first approach is to bound
the error of the vote by twice the error of the corresponding Gibbs classifier
(classifying with a single member of the ensemble selected at random). However,
this approach does not take into account the effect of averaging out of errors
of individual classifiers when taking the majority vote. This effect provides a
significant boost in performance when the errors are independent or negatively
correlated, but when the correlations are strong the advantage from taking the
majority vote is small. The second approach based on PAC-Bayesian C-bounds
takes dependencies between ensemble members into account, but it requires
estimating correlations between the errors of the individual classifiers. When
the correlations are high or the estimation is poor, the bounds degrade. In our
experiments, we compute generalization bounds for random forests on various
benchmark data sets. Because the individual decision trees already perform
well, their predictions are highly correlated and the C-bounds do not lead to
satisfactory results. For the same reason, the bounds based on the analysis of
Gibbs classifiers are typically superior and often reasonably tight. Bounds
based on a validation set coming at the cost of a smaller training set gave
better performance guarantees, but worse performance in most experiments
Information Bottleneck: Exact Analysis of (Quantized) Neural Networks
The information bottleneck (IB) principle has been suggested as a way to
analyze deep neural networks. The learning dynamics are studied by inspecting
the mutual information (MI) between the hidden layers and the input and output.
Notably, separate fitting and compression phases during training have been
reported. This led to some controversy including claims that the observations
are not reproducible and strongly dependent on the type of activation function
used as well as on the way the MI is estimated. Our study confirms that
different ways of binning when computing the MI lead to qualitatively different
results, either supporting or refusing IB conjectures. To resolve the
controversy, we study the IB principle in settings where MI is non-trivial and
can be computed exactly. We monitor the dynamics of quantized neural networks,
that is, we discretize the whole deep learning system so that no approximation
is required when computing the MI. This allows us to quantify the information
flow without measurement errors. In this setting, we observed a fitting phase
for all layers and a compression phase for the output layer in all experiments;
the compression in the hidden layers was dependent on the type of activation
function. Our study shows that the initial IB results were not artifacts of
binning when computing the MI. However, the critical claim that the compression
phase may not be observed for some networks also holds true
Using machine learning for predicting intensive care unit resource use during the COVID-19 pandemic in Denmark
The COVID-19 pandemic has put massive strains on hospitals, and tools to guide hospital planners in resource allocation during the ebbs and flows of the pandemic are urgently needed. We investigate whether machine learning (ML) can be used for predictions of intensive care requirements a fixed number of days into the future. Retrospective design where health Records from 42,526 SARS-CoV-2 positive patients in Denmark was extracted. Random Forest (RF) models were trained to predict risk of ICU admission and use of mechanical ventilation after n days (n = 1, 2, …, 15). An extended analysis was provided for n = 5 and n = 10. Models predicted n-day risk of ICU admission with an area under the receiver operator characteristic curve (ROC-AUC) between 0.981 and 0.995, and n-day risk of use of ventilation with an ROC-AUC between 0.982 and 0.997. The corresponding n-day forecasting models predicted the needed ICU capacity with a coefficient of determination (R(2)) between 0.334 and 0.989 and use of ventilation with an R(2) between 0.446 and 0.973. The forecasting models performed worst, when forecasting many days into the future (for large n). For n = 5, ICU capacity was predicted with ROC-AUC 0.990 and R(2) 0.928, and use of ventilator was predicted with ROC-AUC 0.994 and R(2) 0.854. Random Forest-based modelling can be used for accurate n-day forecasting predictions of ICU resource requirements, when n is not too large